ROAD TRAFFIC ACCIDENT DATASET OF ADDIS ABABA CITY


🧾Description: This data set is collected from Addis Ababa Sub-city police departments for master's research work. The data set has been prepared from manual records of road traffic accidents of the year 2017-20. All the sensitive information has been excluded during data encoding and finally it has 32 features and 12316 instances of the accident. Then it is preprocessed and for identification of major causes of the accident by analyzing it using different machine learning classification algorithms.

source of dataset - Click Here

🧭 Problem Statement: The target feature is Accident_severity which is a multi-class variable. The task is to classify this variable based on the other 31 features step-by-step by going through each day's task. Your metric for evaluation will be f1-score


Libraries


In [ ]:
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline as py
import plotly.tools as tls
import seaborn as sns
import matplotlib
from collections import Counter
import joblib
import shap

from dataprep.eda import plot, plot_correlation, create_report, plot_missing
from IPython.display import display_html
from EDA import DataExplorer as de

pd.set_option('display.max_columns', None)

import warnings
warnings.filterwarnings('ignore')
matplotlib.style.use('seaborn')

Data


In [ ]:
df = pd.read_csv("data/RTA dataset.csv")
In [ ]:
display_html(df.head(2), df.tail(2), df.sample(2))
Time Day_of_week Age_band_of_driver Sex_of_driver Educational_level Vehicle_driver_relation Driving_experience Type_of_vehicle Owner_of_vehicle Service_year_of_vehicle Defect_of_vehicle Area_accident_occured Lanes_or_Medians Road_allignment Types_of_Junction Road_surface_type Road_surface_conditions Light_conditions Weather_conditions Type_of_collision Number_of_vehicles_involved Number_of_casualties Vehicle_movement Casualty_class Sex_of_casualty Age_band_of_casualty Casualty_severity Work_of_casuality Fitness_of_casuality Pedestrian_movement Cause_of_accident Accident_severity
0 17:02:00 Monday 18-30 Male Above high school Employee 1-2yr Automobile Owner Above 10yr No defect Residential areas NaN Tangent road with flat terrain No junction Asphalt roads Dry Daylight Normal Collision with roadside-parked vehicles 2 2 Going straight na na na na NaN NaN Not a Pedestrian Moving Backward Slight Injury
1 17:02:00 Monday 31-50 Male Junior high school Employee Above 10yr Public (> 45 seats) Owner 5-10yrs No defect Office areas Undivided Two way Tangent road with flat terrain No junction Asphalt roads Dry Daylight Normal Vehicle with vehicle collision 2 2 Going straight na na na na NaN NaN Not a Pedestrian Overtaking Slight Injury
Time Day_of_week Age_band_of_driver Sex_of_driver Educational_level Vehicle_driver_relation Driving_experience Type_of_vehicle Owner_of_vehicle Service_year_of_vehicle Defect_of_vehicle Area_accident_occured Lanes_or_Medians Road_allignment Types_of_Junction Road_surface_type Road_surface_conditions Light_conditions Weather_conditions Type_of_collision Number_of_vehicles_involved Number_of_casualties Vehicle_movement Casualty_class Sex_of_casualty Age_band_of_casualty Casualty_severity Work_of_casuality Fitness_of_casuality Pedestrian_movement Cause_of_accident Accident_severity
12314 13:55:00 Sunday 18-30 Female Junior high school Employee Above 10yr Lorry (41?100Q) Owner 2-5yrs No defect Office areas Undivided Two way Tangent road with mountainous terrain and No junction Asphalt roads Dry Darkness - lights lit Normal Vehicle with vehicle collision 2 1 Other na na na na Driver Normal Not a Pedestrian Driving under the influence of drugs Slight Injury
12315 13:55:00 Sunday 18-30 Male Junior high school Employee 5-10yr Other Owner 2-5yrs No defect Outside rural areas Undivided Two way Tangent road with mountainous terrain and O Shape Asphalt roads Dry Darkness - lights lit Normal Vehicle with vehicle collision 2 1 Stopping Pedestrian Female 5 3 Driver Normal Crossing from nearside - masked by parked or s... Changing lane to the right Slight Injury
Time Day_of_week Age_band_of_driver Sex_of_driver Educational_level Vehicle_driver_relation Driving_experience Type_of_vehicle Owner_of_vehicle Service_year_of_vehicle Defect_of_vehicle Area_accident_occured Lanes_or_Medians Road_allignment Types_of_Junction Road_surface_type Road_surface_conditions Light_conditions Weather_conditions Type_of_collision Number_of_vehicles_involved Number_of_casualties Vehicle_movement Casualty_class Sex_of_casualty Age_band_of_casualty Casualty_severity Work_of_casuality Fitness_of_casuality Pedestrian_movement Cause_of_accident Accident_severity
1045 13:55:00 Sunday 31-50 Male NaN NaN NaN NaN NaN NaN NaN Unknown other Tangent road with flat terrain Y Shape Asphalt roads Dry Daylight Normal Vehicle with vehicle collision 2 4 Going straight na na na na NaN NaN Not a Pedestrian No priority to vehicle Slight Injury
9368 8:00:00 Monday 18-30 Male Junior high school Employee Below 1yr Lorry (41?100Q) Owner 2-5yrs No defect Residential areas Double carriageway (median) Tangent road with flat terrain Y Shape Asphalt roads Dry Daylight Normal Rollover 1 1 Going straight na na na na NaN NaN Not a Pedestrian No priority to vehicle Slight Injury
In [ ]:
# ProfileReport(df)
# create_report(df)
In [ ]:
df.shape
Out[ ]:
(12316, 32)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12174 non-null  object
 14  Types_of_Junction            11429 non-null  object
 15  Road_surface_type            12144 non-null  object
 16  Road_surface_conditions      12316 non-null  object
 17  Light_conditions             12316 non-null  object
 18  Weather_conditions           12316 non-null  object
 19  Type_of_collision            12161 non-null  object
 20  Number_of_vehicles_involved  12316 non-null  int64 
 21  Number_of_casualties         12316 non-null  int64 
 22  Vehicle_movement             12008 non-null  object
 23  Casualty_class               12316 non-null  object
 24  Sex_of_casualty              12316 non-null  object
 25  Age_band_of_casualty         12316 non-null  object
 26  Casualty_severity            12316 non-null  object
 27  Work_of_casuality            9118 non-null   object
 28  Fitness_of_casuality         9681 non-null   object
 29  Pedestrian_movement          12316 non-null  object
 30  Cause_of_accident            12316 non-null  object
 31  Accident_severity            12316 non-null  object
dtypes: int64(2), object(30)
memory usage: 3.0+ MB
In [ ]:
df.columns
Out[ ]:
Index(['Time', 'Day_of_week', 'Age_band_of_driver', 'Sex_of_driver',
       'Educational_level', 'Vehicle_driver_relation', 'Driving_experience',
       'Type_of_vehicle', 'Owner_of_vehicle', 'Service_year_of_vehicle',
       'Defect_of_vehicle', 'Area_accident_occured', 'Lanes_or_Medians',
       'Road_allignment', 'Types_of_Junction', 'Road_surface_type',
       'Road_surface_conditions', 'Light_conditions', 'Weather_conditions',
       'Type_of_collision', 'Number_of_vehicles_involved',
       'Number_of_casualties', 'Vehicle_movement', 'Casualty_class',
       'Sex_of_casualty', 'Age_band_of_casualty', 'Casualty_severity',
       'Work_of_casuality', 'Fitness_of_casuality', 'Pedestrian_movement',
       'Cause_of_accident', 'Accident_severity'],
      dtype='object')
In [ ]:
col_map={
    'Time': 'time',
    'Day_of_week': 'day_of_week',
    'Age_band_of_driver': 'driver_age',
    'Sex_of_driver': 'driver_sex',
    'Educational_level': 'educational_level',
    'Vehicle_driver_relation': 'vehicle_driver_relation',
    'Driving_experience': 'driving_experience',
    'Type_of_vehicle': 'vehicle_type',
    'Owner_of_vehicle': 'vehicle_owner',
    'Service_year_of_vehicle': 'service_year',
    'Defect_of_vehicle': 'vehicle_defect',
    'Area_accident_occured': 'accident_area',
    'Lanes_or_Medians': 'lanes',
    'Road_allignment': 'road_allignment',
    'Types_of_Junction': 'junction_type',
    'Road_surface_type': 'surface_type',
    'Road_surface_conditions': 'road_surface_conditions',
    'Light_conditions': 'light_condition',
    'Weather_conditions': 'weather_condition',
    'Type_of_collision': 'collision_type',
    'Number_of_vehicles_involved': 'vehicles_involved',
    'Number_of_casualties': 'casualties',
    'Vehicle_movement': 'vehicle_movement',
    'Casualty_class': 'casualty_class',
    'Sex_of_casualty': 'casualty_sex' , 
    'Age_band_of_casualty': 'casualty_age', 
    'Casualty_severity': 'casualty_severity',
    'Work_of_casuality': 'casualty_work', 
    'Fitness_of_casuality': 'casualty_fitness',
    'Pedestrian_movement': 'pedestrian_movement',
    'Cause_of_accident': 'accident_cause',
    'Accident_severity': 'accident_severity'
}
df.rename(columns=col_map, inplace=True)
In [ ]:
# calculate the missing values count and convert it to a dataframe
missing_vals = pd.DataFrame(df.isna().sum(), columns=['missing'])
In [ ]:
# concatenate the two dataframes
result = pd.concat([df.describe(include=['O']).T, missing_vals], axis=1)
In [ ]:
result
Out[ ]:
count unique top freq missing
time 12316 1074 15:30:00 120 0
day_of_week 12316 7 Friday 2041 0
driver_age 12316 5 18-30 4271 0
driver_sex 12316 3 Male 11437 0
educational_level 11575 7 Junior high school 7619 741
vehicle_driver_relation 11737 4 Employee 9627 579
driving_experience 11487 7 5-10yr 3363 829
vehicle_type 11366 17 Automobile 3205 950
vehicle_owner 11834 4 Owner 10459 482
service_year 8388 6 Unknown 2883 3928
vehicle_defect 7889 3 No defect 7777 4427
accident_area 12077 14 Other 3819 239
lanes 11931 7 Two-way (divided with broken lines road marking) 4411 385
road_allignment 12174 9 Tangent road with flat terrain 10459 142
junction_type 11429 8 Y Shape 4543 887
surface_type 12144 5 Asphalt roads 11296 172
road_surface_conditions 12316 4 Dry 9340 0
light_condition 12316 4 Daylight 8798 0
weather_condition 12316 9 Normal 10063 0
collision_type 12161 10 Vehicle with vehicle collision 8774 155
vehicle_movement 12008 13 Going straight 8158 308
casualty_class 12316 4 Driver or rider 4944 0
casualty_sex 12316 3 Male 5253 0
casualty_age 12316 6 na 4443 0
casualty_severity 12316 4 3 7076 0
casualty_work 9118 7 Driver 5903 3198
casualty_fitness 9681 5 Normal 9608 2635
pedestrian_movement 12316 9 Not a Pedestrian 11390 0
accident_cause 12316 20 No distancing 2263 0
accident_severity 12316 3 Slight Injury 10415 0
vehicles_involved NaN NaN NaN NaN 0
casualties NaN NaN NaN NaN 0
In [ ]:
de.generate_data_description_table(df)
Out[ ]:
  count mean std min 25% 50% 75% max
casualties 12316.000000 1.548149 1.007179 1.000000 1.000000 1.000000 2.000000 8.000000
vehicles_involved 12316.000000 2.040679 0.688790 1.000000 2.000000 2.000000 2.000000 7.000000
In [ ]:
print("Number of duplicates: ", df.duplicated().sum())
Number of duplicates:  0
In [ ]:
for i in df.columns:
    print(f"Unique value in {i}:")
    print(df[i].unique(),'\n')
Unique value in time:
['17:02:00' '1:06:00' '14:15:00' ... '7:24:00' '19:18:00' '2:47:00'] 

Unique value in day_of_week:
['Monday' 'Sunday' 'Friday' 'Wednesday' 'Saturday' 'Thursday' 'Tuesday'] 

Unique value in driver_age:
['18-30' '31-50' 'Under 18' 'Over 51' 'Unknown'] 

Unique value in driver_sex:
['Male' 'Female' 'Unknown'] 

Unique value in educational_level:
['Above high school' 'Junior high school' nan 'Elementary school'
 'High school' 'Unknown' 'Illiterate' 'Writing & reading'] 

Unique value in vehicle_driver_relation:
['Employee' 'Unknown' 'Owner' nan 'Other'] 

Unique value in driving_experience:
['1-2yr' 'Above 10yr' '5-10yr' '2-5yr' nan 'No Licence' 'Below 1yr'
 'unknown'] 

Unique value in vehicle_type:
['Automobile' 'Public (> 45 seats)' 'Lorry (41?100Q)' nan
 'Public (13?45 seats)' 'Lorry (11?40Q)' 'Long lorry' 'Public (12 seats)'
 'Taxi' 'Pick up upto 10Q' 'Stationwagen' 'Ridden horse' 'Other' 'Bajaj'
 'Turbo' 'Motorcycle' 'Special vehicle' 'Bicycle'] 

Unique value in vehicle_owner:
['Owner' 'Governmental' nan 'Organization' 'Other'] 

Unique value in service_year:
['Above 10yr' '5-10yrs' nan '1-2yr' '2-5yrs' 'Unknown' 'Below 1yr'] 

Unique value in vehicle_defect:
['No defect' nan '7' '5'] 

Unique value in accident_area:
['Residential areas' 'Office areas' '  Recreational areas'
 ' Industrial areas' nan 'Other' ' Church areas' '  Market areas'
 'Unknown' 'Rural village areas' ' Outside rural areas' ' Hospital areas'
 'School areas' 'Rural village areasOffice areas' 'Recreational areas'] 

Unique value in lanes:
[nan 'Undivided Two way' 'other' 'Double carriageway (median)' 'One way'
 'Two-way (divided with solid lines road marking)'
 'Two-way (divided with broken lines road marking)' 'Unknown'] 

Unique value in road_allignment:
['Tangent road with flat terrain' nan
 'Tangent road with mild grade and flat terrain' 'Escarpments'
 'Tangent road with rolling terrain' 'Gentle horizontal curve'
 'Tangent road with mountainous terrain and'
 'Steep grade downward with mountainous terrain' 'Sharp reverse curve'
 'Steep grade upward with mountainous terrain'] 

Unique value in junction_type:
['No junction' 'Y Shape' 'Crossing' 'O Shape' 'Other' 'Unknown' 'T Shape'
 'X Shape' nan] 

Unique value in surface_type:
['Asphalt roads' 'Earth roads' nan 'Asphalt roads with some distress'
 'Gravel roads' 'Other'] 

Unique value in road_surface_conditions:
['Dry' 'Wet or damp' 'Snow' 'Flood over 3cm. deep'] 

Unique value in light_condition:
['Daylight' 'Darkness - lights lit' 'Darkness - no lighting'
 'Darkness - lights unlit'] 

Unique value in weather_condition:
['Normal' 'Raining' 'Raining and Windy' 'Cloudy' 'Other' 'Windy' 'Snow'
 'Unknown' 'Fog or mist'] 

Unique value in collision_type:
['Collision with roadside-parked vehicles'
 'Vehicle with vehicle collision' 'Collision with roadside objects'
 'Collision with animals' 'Other' 'Rollover' 'Fall from vehicles'
 'Collision with pedestrians' 'With Train' 'Unknown' nan] 

Unique value in vehicles_involved:
[2 1 3 6 4 7] 

Unique value in casualties:
[2 1 3 4 6 5 8 7] 

Unique value in vehicle_movement:
['Going straight' 'U-Turn' 'Moving Backward' 'Turnover' 'Waiting to go'
 'Getting off' 'Reversing' 'Unknown' 'Parked' 'Stopping' 'Overtaking'
 'Other' 'Entering a junction' nan] 

Unique value in casualty_class:
['na' 'Driver or rider' 'Pedestrian' 'Passenger'] 

Unique value in casualty_sex:
['na' 'Male' 'Female'] 

Unique value in casualty_age:
['na' '31-50' '18-30' 'Under 18' 'Over 51' '5'] 

Unique value in casualty_severity:
['na' '3' '2' '1'] 

Unique value in casualty_work:
[nan 'Driver' 'Other' 'Unemployed' 'Employee' 'Self-employed' 'Student'
 'Unknown'] 

Unique value in casualty_fitness:
[nan 'Normal' 'Deaf' 'Other' 'Blind' 'NormalNormal'] 

Unique value in pedestrian_movement:
['Not a Pedestrian' "Crossing from driver's nearside"
 'Crossing from nearside - masked by parked or statioNot a Pedestrianry vehicle'
 'Unknown or other'
 'Crossing from offside - masked by  parked or statioNot a Pedestrianry vehicle'
 'In carriageway, statioNot a Pedestrianry - not crossing  (standing or playing)'
 'Walking along in carriageway, back to traffic'
 'Walking along in carriageway, facing traffic'
 'In carriageway, statioNot a Pedestrianry - not crossing  (standing or playing) - masked by parked or statioNot a Pedestrianry vehicle'] 

Unique value in accident_cause:
['Moving Backward' 'Overtaking' 'Changing lane to the left'
 'Changing lane to the right' 'Overloading' 'Other'
 'No priority to vehicle' 'No priority to pedestrian' 'No distancing'
 'Getting off the vehicle improperly' 'Improper parking' 'Overspeed'
 'Driving carelessly' 'Driving at high speed' 'Driving to the left'
 'Unknown' 'Overturning' 'Turnover' 'Driving under the influence of drugs'
 'Drunk driving'] 

Unique value in accident_severity:
['Slight Injury' 'Serious Injury' 'Fatal injury'] 

In [ ]:
# df = df.replace('na', np.nan)
In [ ]:
class_count = pd.value_counts(df['accident_severity'], sort = True).sort_index()
In [ ]:
fig = go.Figure(data=[go.Pie(labels=['Fatal injury', 'Serious Injury', 'Slight Injury'], 
                                    values=class_count, 
                                    pull=[0.1, 0, 0], 
                                    opacity=0.85)])
fig.update_layout(
    title_text="Composition of Accident Severity")
fig.show()
In [ ]:
fig = px.histogram(df, x="accident_severity",
                   title='Composition of Accident Severity',
                   opacity=0.85, # represent bars with log scale
                   color = 'accident_severity', text_auto=True)

fig.update_layout({
"plot_bgcolor": "rgba(0, 0, 0, 0)",
"paper_bgcolor": "rgba(0, 0, 0, 0)",
})

fig.show()
In [ ]:
fig = px.treemap(df, path=['accident_cause'], width=800, height=400)
fig.update_layout(
    margin = dict(t=0, l=0, r=0, b=50))

fig.show()
In [ ]:
# converting 'time' to datetime
df['time'] = pd.to_datetime(df['time'])

# date (day-month-year) time
df["time"].dt.hour

# extracting hour and minute from timestamp
df['hour'] = df['time'].dt.hour
df['minute'] = df['time'].dt.minute
df.drop('time', axis=1, inplace=True)
In [ ]:
plt.figure(figsize=(15,70))
plotnumber = 1

for col in df.drop(['hour', 'minute', 'lanes', 'road_allignment', 'pedestrian_movement'], axis=1):
    if plotnumber <= df.shape[1]:
        ax1 = plt.subplot(16,2,plotnumber)
        sns.countplot(data=df, y=col, palette='Dark2')
        plt.xticks(fontsize=12)
        plt.yticks(fontsize=12)
        plt.title(col.title(), fontsize=14)
        plt.xlabel('')
        plt.ylabel('')
    plotnumber +=1
plt.tight_layout()
In [ ]:
plt.figure(figsize=(10,3))
sns.countplot(data=df, y='lanes', palette = 'Dark2')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Lanes', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
In [ ]:
plt.figure(figsize=(10,3))
sns.countplot(data=df, y='road_allignment', palette = 'Dark2')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Road Allignment', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
In [ ]:
plt.figure(figsize=(10,5))
sns.countplot(data=df, y='pedestrian_movement', palette = 'Dark2')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Pedestrian Movement', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
In [ ]:
plt.figure(figsize=(10,5))
sns.countplot(data=df, y='hour')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Hour', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()
In [ ]:
plt.figure(figsize=(10,15))
sns.countplot(data=df, y='minute')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Minute', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()

💡 Observations:

Most of the accidents:

  • involved 2 vehicles and 2 casualties
  • occured on Fridays and after noon hours

Most of the drivers:

  • are male and in 18-30 yrs age group
  • have only went upto Junior high school and are employees
  • have 5-10 yrs of driving experience

Most of the accidents happened with personally owned passenger vehicles.

Most of the drivers have met with accident on:

  • two-way lanes
  • tangent road with flat terrains
  • Y shaped junctions
  • asphalt roads
  • day time
  • normal weather conditions

Most of the casualties:

  • happened to physically fit male drivers
  • are of severity 3

The conditions on which most of the drivers met with the accident are:

  • vehicle to vehicle collision
  • straight movement
  • no pedestrian accidents

Not keeping enough distance between the vehicles was the major cause for most of the accidents and majority of the accidents resulted in slight injury.

💡 Observations:

  • Most of the accidents have taken place between 3pm to 6pm with maximum numbers at 5 pm
  • Most of the time in minutes seems to have been recorded in intervals of 5 minutes for the sake of convenience. Hence more numbers to be seen at the 0 and 30 minutes.
In [ ]:
min = list(range(5,56, 5))
def convert_minutes(x: int):
    for m in min:
        if x % m == x and x > m-5:
            return m
        if x in [56,57,58,59]:
            return 0
        if x in min+[0]:
            return x
In [ ]:
df['minute'] = df['minute'].apply(lambda x: convert_minutes(x))
In [ ]:
plt.figure(figsize=(5,7))
sns.countplot(data=df, y='minute')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Minute', fontsize=14)
plt.xlabel('')
plt.ylabel('')
plt.tight_layout()

💡 Observations:

  • Most of the accidents with fatal injuries happened between 2pm to 7pm.
  • Most of the accidents with fatal injuries happened on weekends.
  • Highest number of non-fatal injuries happened at 5pm.
  • Highest number of non-fatal injuries happened on fridays.
  • Most accidents are caused by drivers aged 18-30 and the least by drivers aged under 18.
  • Proportion of fatal accidents are lower for female drivers.
  • Drivers with 2-5yrs of experience caused most accidents with fatal injury and those with 5-10yrs experience caused most accidents with non-fatal injuries.
  • Proportion of fatal injuries caused by vehicles with more than 10yrs of service is lower compared to non-fatal injuries.
  • Though most of the accident happened around offices, a higher proportion of accidents happened around residential areas have led to fatal injuries.
  • Most accidents with fatal injuries occured on undivided two-ways.
  • Most accidents with non-fatal injuries occured on two-way divided with broken lines road marking
  • Double carriageway has a lower proportion of fatal accidents compared to non-fatal accidents.
  • Severity of accidents increases at places with no junctions and the most number of fatal injuries occured at places with no junctions.
  • Crossings and Y-shaped junctions shows a decreasing trend with respect to severity of accidents though the numbers are high.
  • Most of the accidents with fatal injury have happened at night.
  • Collisions with pedestrians have resulted in more fatal injuries compared to non-fatal injuries.
  • Accidents with 4 casualties have a huge proportion on fatal injuries compared to non-fatal injuries.
  • Accidents involving drivers with 18-30yrs of age have an increasing pattern with respect to severity of injuries.
  • Moving backward led to most of the accidents with fatal injuries whereas failing to keep enough distance between vehicles led to more number of accidents with non-fatal injuries.

Testing Hypothesis¶

In [ ]:
hypothesis_df = df.copy(deep=True)
❗️The copy() method performs a shallow copy, which means that the new object references the same data as the original object. This can cause issues if the original object is modified, as the changes will also affect the new object.

To avoid this issue, the deep=True parameter is used to perform a deep copy, which means that a completely new object is created with its own data. This ensures that any changes made to the original object do not affect the new object.

Hypothesis 1: Which of both genders is actually dangerous in Driving?

Calculating the Conditional Probability; Probability of an Accident Happen When the Driver is a Female and Repeating the Calculations when it's Male.
P(Gender = Male | Severity = 2) P(Gender = Female | Severity = 2)

In [ ]:
Counter(df["driver_sex"])
Out[ ]:
Counter({'Male': 11437, 'Female': 701, 'Unknown': 178})
In [ ]:
((hypothesis_df.groupby(['driver_sex']).size() / hypothesis_df["driver_sex"].count()) * 100).add_prefix('Accidents(in %) Caused by ')
Out[ ]:
driver_sex
Accidents(in %) Caused by Female      5.691783
Accidents(in %) Caused by Male       92.862943
Accidents(in %) Caused by Unknown     1.445274
dtype: float64
In [ ]:
male_df = hypothesis_df.loc[hypothesis_df.driver_sex == 'Male']
female_df = hypothesis_df.loc[hypothesis_df.driver_sex == 'Female']

fig = go.Figure()

fig.add_trace(go.Bar(x=male_df.accident_cause.unique(), y=male_df.accident_cause.value_counts().values, 
                     name='Male'))
fig.add_trace(go.Bar(x=female_df.accident_cause.unique(), y=female_df.accident_cause.value_counts().values, 
                     name='Female'))

fig.update_layout(title='Accident Cause by Driver Sex',
                  xaxis_title='Accident Cause', yaxis_title='Count',
                  barmode='group', bargap=0.1, bargroupgap=0.2)

fig.show()
In [ ]:
injury_by_sex = pd.crosstab(index=hypothesis_df['driver_sex'].loc[hypothesis_df['driver_sex'] !="Unknown"], columns=df['accident_severity']
                 , margins=True)
injury_by_sex
Out[ ]:
accident_severity Fatal injury Serious Injury Slight Injury All
driver_sex
Female 5 104 592 701
Male 152 1621 9664 11437
All 157 1725 10256 12138

$P(Fatal | Male) = P(Fatal | Female)?$

In [ ]:
print("P(Fatal | Female) = {:.2f}%".format(injury_by_sex.iloc[0,0] / injury_by_sex.iloc[0, 3]*100))
print("P(Fatal | Male) = {:.2f}%\n".format(injury_by_sex.iloc[1,0] / injury_by_sex.iloc[1, 3]*100))
P(Fatal | Female) = 0.71%
P(Fatal | Male) = 1.33%

$\therefore P(Fatal | Male) \approx 2P(Fatal | Female)$

In [ ]:
print("P(Serious | Female) = {:.2f}%".format(injury_by_sex.iloc[0,1] / injury_by_sex.iloc[0, 3]*100))
print("P(Serious | Male) = {:.2f}%\n".format(injury_by_sex.iloc[1,1] / injury_by_sex.iloc[1, 3]*100))
P(Serious | Female) = 14.84%
P(Serious | Male) = 14.17%

In [ ]:
print("P(Slight | Female) = {:.2f}%".format(injury_by_sex.iloc[0,2] / injury_by_sex.iloc[0, 3]*100))
print("P(Slight | Male) = {:.2f}%\n".format(injury_by_sex.iloc[1,2] / injury_by_sex.iloc[1, 3]*100))
P(Slight | Female) = 84.45%
P(Slight | Male) = 84.50%

In [ ]:
 
In [ ]:
print("P(Gender = Female | Severity = Fatal) = {:.2f}%".format(injury_by_sex.iloc[0,0]/injury_by_sex.iloc[2,0]*100))
print("P(Gender = Male   | Severity = Fatal) = {:.2f}%\n".format(injury_by_sex.iloc[1,0]/injury_by_sex.iloc[2, 0]*100))

print("P(Gender = Female | Severity = Serious Injury) = {:.2f}%".format(injury_by_sex.iloc[0,1]/injury_by_sex.iloc[2,1]*100))
print("P(Gender = Male   | Severity = Serious Injury) = {:.2f}%\n".format(injury_by_sex.iloc[1,1]/injury_by_sex.iloc[2,1]*100))

print("P(Gender = Female | Severity = Slight Injury) = {:.2f}%".format(injury_by_sex.iloc[0,2]/injury_by_sex.iloc[2,2]*100))
print("P(Gender = Male   | Severity = Slight Injury) = {:.2f}%".format(injury_by_sex.iloc[1,2]/injury_by_sex.iloc[2,2]*100))
P(Gender = Female | Severity = Fatal) = 3.18%
P(Gender = Male   | Severity = Fatal) = 96.82%

P(Gender = Female | Severity = Serious Injury) = 6.03%
P(Gender = Male   | Severity = Serious Injury) = 93.97%

P(Gender = Female | Severity = Slight Injury) = 5.77%
P(Gender = Male   | Severity = Slight Injury) = 94.23%

Hypothesis 2: Is Speeding is the main Cause for Car Crashes?

In [ ]:
accident_cause_by_injury = pd.crosstab(index=hypothesis_df['accident_cause'], columns=hypothesis_df['accident_severity'], margins=True)
accident_cause_by_injury #.sort_values(by = ["All"], ascending = False)
Out[ ]:
accident_severity Fatal injury Serious Injury Slight Injury All
accident_cause
Changing lane to the left 16 206 1251 1473
Changing lane to the right 23 260 1525 1808
Driving at high speed 2 31 141 174
Driving carelessly 22 209 1171 1402
Driving to the left 4 53 227 284
Driving under the influence of drugs 5 46 289 340
Drunk driving 0 3 24 27
Getting off the vehicle improperly 3 29 165 197
Improper parking 1 2 22 25
Moving Backward 26 162 949 1137
No distancing 20 303 1940 2263
No priority to pedestrian 5 95 621 721
No priority to vehicle 13 149 1045 1207
Other 7 64 385 456
Overloading 2 10 47 59
Overspeed 1 15 45 61
Overtaking 4 75 351 430
Overturning 2 23 124 149
Turnover 2 6 70 78
Unknown 0 2 23 25
All 158 1743 10415 12316
In [ ]:
accidents_by_high_speed = (accident_cause_by_injury.loc["Overspeed", "All"] + 
                           accident_cause_by_injury.loc["Driving at high speed", "All"])

all_accidents = accident_cause_by_injury.loc["All", "All"]

print(accidents_by_high_speed)
print(all_accidents)
235
12316
In [ ]:
print("Injuries(all types due to speeding): " + str(accidents_by_high_speed))
print("In percentage: {:.2f}%".format(accidents_by_high_speed/all_accidents*100))
Injuries(all types due to speeding): 235
In percentage: 1.91%

💡 Observations:

  • Speeding is not the main factor for car accidents.

Hypothesis 3: Do weekends account for more accidents?

In [ ]:
injury_by_day = pd.crosstab(index=hypothesis_df['day_of_week'], columns=hypothesis_df['accident_severity'], margins=True)
injury_by_day
Out[ ]:
accident_severity Fatal injury Serious Injury Slight Injury All
day_of_week
Friday 16 313 1712 2041
Monday 12 204 1465 1681
Saturday 37 245 1384 1666
Sunday 35 190 1242 1467
Thursday 22 272 1557 1851
Tuesday 17 257 1496 1770
Wednesday 19 262 1559 1840
All 158 1743 10415 12316
In [ ]:
print('Accident percentage on weekdays:',round(sum([injury_by_day.iloc[i,3] for i in range(0,7) if injury_by_day.index[i] not in ['Saturday','Sunday']])/injury_by_day.iloc[7,3],2))
print('Accident percentage on weekends:',round(sum([injury_by_day.iloc[i,3] for i in range(0,7) if injury_by_day.index[i] in ['Saturday','Sunday']])/injury_by_day.iloc[7,3],2))
Accident percentage on weekdays: 0.75
Accident percentage on weekends: 0.25

💡 Observations:

  • Its quite clear that there is no indication of more accidents happening on weekends

Hypothesis 4: Are there more dangerous injuries at night than in day?

In [ ]:
injury_by_light_condition = pd.crosstab(index=hypothesis_df['light_condition'], columns=hypothesis_df['accident_severity'], margins=True)
injury_by_light_condition
Out[ ]:
accident_severity Fatal injury Serious Injury Slight Injury All
light_condition
Darkness - lights lit 66 465 2755 3286
Darkness - lights unlit 0 7 33 40
Darkness - no lighting 5 49 138 192
Daylight 87 1222 7489 8798
All 158 1743 10415 12316
In [ ]:
sum(injury_by_light_condition.iloc[0:3, 0] + injury_by_light_condition.iloc[0:3, 1])
Out[ ]:
592
In [ ]:
sol = 0
for i in [0,1]:
    for j in [0,1,2]:
        sol += injury_by_light_condition.iloc[j,i]

r = sol/(sum(injury_by_light_condition.iloc[i,3] for i in [0,1,2]))
print('Dangerous injuries at night:',round(r,2),'%')

r = sum(injury_by_light_condition.iloc[i,2] for i in [0,1,2])/sum(injury_by_light_condition.iloc[i,3] for i in [0,1,2])
print('Slight injuries at night:',round(r,2),'%')
Dangerous injuries at night: 0.17 %
Slight injuries at night: 0.83 %

💡 Observations:

  • Although it does that in daylight there are more number of accidents but the dangerous injuries percentage are almost same either be it day or night.

Hypothesis 5: More accidents under some specific non-normal weather?

Without considering normal weather

In [ ]:
injury_by_weather = pd.crosstab(index=hypothesis_df['weather_condition'], columns=hypothesis_df['accident_severity'], margins=True)
injury_by_weather.drop(['Normal','All'],axis=0)
Out[ ]:
accident_severity Fatal injury Serious Injury Slight Injury All
weather_condition
Cloudy 0 8 117 125
Fog or mist 0 1 9 10
Other 0 28 268 296
Raining 23 158 1150 1331
Raining and Windy 0 2 38 40
Snow 0 5 56 61
Unknown 0 51 241 292
Windy 0 16 82 98

💡 Observations:

  • Without any analysis we can clearly see that rainy weather causes more accident and also all the fatal injuries have occured under non-normal conditions are there in rainy weather

⚙️ Preprocessing¶

In [ ]:
df.isna().sum()
Out[ ]:
day_of_week                   0
driver_age                    0
driver_sex                    0
educational_level           741
vehicle_driver_relation     579
driving_experience          829
vehicle_type                950
vehicle_owner               482
service_year               3928
vehicle_defect             4427
accident_area               239
lanes                       385
road_allignment             142
junction_type               887
surface_type                172
road_surface_conditions       0
light_condition               0
weather_condition             0
collision_type              155
vehicles_involved             0
casualties                    0
vehicle_movement            308
casualty_class                0
casualty_sex                  0
casualty_age                  0
casualty_severity             0
casualty_work              3198
casualty_fitness           2635
pedestrian_movement           0
accident_cause                0
accident_severity             0
hour                          0
minute                        0
dtype: int64
In [ ]:
def fill_na_with_distribution(df, cols):
    for col in cols:
        dist = df[col].value_counts(normalize=True)
        missing = df[col].isnull()
        if missing.sum() > 0:
            df.loc[missing, col] = np.random.choice(dist.index, size=missing.sum(), p=dist.values)
    return df
In [ ]:
df = fill_na_with_distribution(df, ['casualty_class', 'casualty_sex', 'casualty_age', 'casualty_severity'])
In [ ]:
# dropping columns that can cause imbalance while imputation
df.drop(columns = ['vehicle_defect', 'vehicle_driver_relation', 'casualty_work', 'casualty_fitness'], inplace=True)
In [ ]:
impute_cols = [x for x in df.isna().sum()[df.isna().sum() != 0].index.tolist()]
for feat in impute_cols:
    mode = df[feat].mode()[0]
    df[feat].fillna(mode, inplace=True)
In [ ]:
df.isna().sum() 
Out[ ]:
day_of_week                0
driver_age                 0
driver_sex                 0
educational_level          0
driving_experience         0
vehicle_type               0
vehicle_owner              0
service_year               0
accident_area              0
lanes                      0
road_allignment            0
junction_type              0
surface_type               0
road_surface_conditions    0
light_condition            0
weather_condition          0
collision_type             0
vehicles_involved          0
casualties                 0
vehicle_movement           0
casualty_class             0
casualty_sex               0
casualty_age               0
casualty_severity          0
pedestrian_movement        0
accident_cause             0
accident_severity          0
hour                       0
minute                     0
dtype: int64

⛓ Preprocessing¶

In [ ]:
def ordinal_encoder(df, feats): 
    for feat in feats:    
        feat_val = list(np.arange(df[feat].nunique()))
        feat_key = list(df[feat].sort_values().unique())
        feat_dict = dict(zip(feat_key, feat_val))
        df[feat] = df[feat].map(feat_dict)
    return df

df = ordinal_encoder(df, df.drop(['accident_severity'], axis=1).columns)
df.shape
Out[ ]:
(12316, 29)
In [ ]:
display_html(df.head(2), df.tail(2), df.sample(2))
day_of_week driver_age driver_sex educational_level driving_experience vehicle_type vehicle_owner service_year accident_area lanes road_allignment junction_type surface_type road_surface_conditions light_condition weather_condition collision_type vehicles_involved casualties vehicle_movement casualty_class casualty_sex casualty_age casualty_severity pedestrian_movement accident_cause accident_severity hour minute
0 1 0 1 0 0 0 3 3 9 2 5 1 0 0 3 2 3 1 1 2 3 2 5 3 5 9 Slight Injury 17 1
1 1 1 1 4 3 11 3 2 6 4 5 1 0 0 3 2 8 1 1 2 3 2 5 3 5 16 Slight Injury 17 1
day_of_week driver_age driver_sex educational_level driving_experience vehicle_type vehicle_owner service_year accident_area lanes road_allignment junction_type surface_type road_surface_conditions light_condition weather_condition collision_type vehicles_involved casualties vehicle_movement casualty_class casualty_sex casualty_age casualty_severity pedestrian_movement accident_cause accident_severity hour minute
12314 3 0 0 4 3 5 3 1 6 4 7 1 0 0 0 2 8 1 0 4 3 2 5 3 5 5 Slight Injury 13 11
12315 3 0 1 4 2 7 3 1 5 4 7 2 0 0 0 2 8 1 0 8 2 0 2 2 1 1 Slight Injury 13 11
day_of_week driver_age driver_sex educational_level driving_experience vehicle_type vehicle_owner service_year accident_area lanes road_allignment junction_type surface_type road_surface_conditions light_condition weather_condition collision_type vehicles_involved casualties vehicle_movement casualty_class casualty_sex casualty_age casualty_severity pedestrian_movement accident_cause accident_severity hour minute
3969 4 4 0 1 2 0 0 1 7 6 5 1 0 0 3 7 8 1 0 4 0 1 1 1 5 10 Slight Injury 15 9
11228 3 1 1 6 1 5 3 5 7 6 5 7 0 0 3 2 8 1 2 2 1 0 3 2 5 12 Slight Injury 10 4
In [ ]:
plt.figure(figsize=(22,17))
sns.set(font_scale=0.8)
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.CMRmap_r)
Out[ ]:
<AxesSubplot: >

Upsampling¶

In [ ]:
# Importing necessary libraries
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import (
    cross_val_score,
    StratifiedKFold,
    KFold,
    train_test_split,
    GridSearchCV,
)
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    recall_score,
    precision_score,
    f1_score,
    confusion_matrix,
)
from mlxtend.evaluate import mcnemar_table, mcnemar_tables
from mlxtend.plotting import checkerboard_plot, plot_decision_regions
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
In [ ]:
X = df.drop('accident_severity', axis=1)
y = df['accident_severity']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(8621, 28) (3695, 28) (8621,) (3695,)
In [ ]:
# upsampling using smote

counter = Counter(y_train)
counter
Out[ ]:
Counter({'Slight Injury': 7324, 'Serious Injury': 1191, 'Fatal injury': 106})
In [ ]:
# upsampling using smote

counter = Counter(y_train)

print("=============================")

for k,v in counter.items():
    per = 100*v/len(y_train)
    print(f"Class= {k}, n={v} ({per:.2f}%)")

oversample = SMOTE()
X_train, y_train = oversample.fit_resample(X_train, y_train)

counter = Counter(y_train)

print("=============================")

for k,v in counter.items():
    per = 100*v/len(y_train)
    print(f"Class= {k}, n={v} ({per:.2f}%)")

print("=============================")

print("Upsampled data shape: ", X_train.shape, y_train.shape)
=============================
Class= Slight Injury, n=7324 (84.96%)
Class= Serious Injury, n=1191 (13.82%)
Class= Fatal injury, n=106 (1.23%)
=============================
Class= Slight Injury, n=7324 (33.33%)
Class= Serious Injury, n=7324 (33.33%)
Class= Fatal injury, n=7324 (33.33%)
=============================
Upsampled data shape:  (21972, 28) (21972,)
In [ ]:
y_test = ordinal_encoder(pd.DataFrame(y_test, columns = ['accident_severity']), pd.DataFrame(y_test, columns = ['accident_severity']).columns)['accident_severity']
y_train = ordinal_encoder(pd.DataFrame(y_train, columns = ['accident_severity']), pd.DataFrame(y_train, columns = ['accident_severity']).columns)['accident_severity']

🧰 Baseline Modelling¶

In [ ]:
def modelling(X_train, y_train, X_test, y_test, **kwargs):
    scores = {}
    models = []
    bvd = {}
    if 'xgb' in kwargs.keys() and kwargs['xgb']:
        xgb = XGBClassifier()
        xgb.fit(X_train._get_numeric_data(), np.ravel(y_train, order='C'))
        y_pred = xgb.predict(X_test._get_numeric_data())
        scores['xgb']= [accuracy_score(y_test, y_pred)]
        models.append(xgb)


    if 'rf' in kwargs.keys() and kwargs['rf']:
        rf = RandomForestClassifier(n_estimators=200)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_test)
        scores['rf']= [accuracy_score(y_test, y_pred)]
        models.append(rf)

    if 'extree' in kwargs.keys() and kwargs['extree']:
        extree = ExtraTreesClassifier()
        extree.fit(X_train, y_train)
        y_pred = extree.predict(X_test)
        scores['extree'] = [accuracy_score(y_test, y_pred)]
        models.append(extree)
        
    return scores,models
In [ ]:
scores,models = modelling(X_train,y_train, X_test, y_test, xgb=True, rf=True, extree=True)
print(scores)
{'xgb': [0.7948579161028417], 'rf': [0.7964817320703653], 'extree': [0.8056833558863329]}
In [ ]:
y_pred_xgb = models[0].predict(X_test)
y_pred_rf = models[1].predict(X_test)
y_pred_ext = models[2].predict(X_test)


tb = mcnemar_tables(y_test, 
                    y_pred_xgb,
                    y_pred_rf, 
                    y_pred_ext)

brd = checkerboard_plot(tb['model_0 vs model_1'],
                        figsize=(5, 5),
                        fmt='%d',
                        col_labels=['model rf wrong', 'model rf right'],
                        row_labels=['model xgb wrong', 'model xgb right'])
plt.show()

brd = checkerboard_plot(tb['model_0 vs model_2'],
                        figsize=(5, 5),
                        fmt='%d',
                        col_labels=['model ext wrong', 'model ext right'],
                        row_labels=['model xgb wrong', 'model xgb right'])
plt.show()

brd = checkerboard_plot(tb['model_1 vs model_2'],
                        figsize=(5, 5),
                        fmt='%d',
                        col_labels=['model ext wrong', 'model ext right'],
                        row_labels=['model rf wrong', 'model rf right'])
plt.show()
In [ ]:
def model_performance(model, y_test, y_hat) : 
    conf_matrix = confusion_matrix(y_test, y_hat)
    trace1 = go.Heatmap(z = conf_matrix  ,x = ["0 (pred)","1 (pred)", "2 (pred)"],
                        y = ["0 (true)","1 (true)", "2 (true)"],xgap = 2, ygap = 2, 
                        colorscale = 'Viridis', showscale  = False)

    #Show metrics
    Accuracy  =  accuracy_score(y_test, y_hat)
    Precision =  precision_score(y_test, y_pred, average= 'weighted')
    Recall    =  recall_score(y_test, y_pred, average= 'weighted')
    F1_score  =  f1_score(y_test, y_pred, average= 'weighted')

    show_metrics = pd.DataFrame(data=[[Accuracy , Precision, Recall, F1_score]])
    show_metrics = show_metrics.T

    colors = ['gold', 'lightgreen', 'lightcoral', 'lightskyblue']
    trace2 = go.Bar(x = (show_metrics[0].values), 
                   y = ['Accuracy', 'Precision', 'Recall', 'F1_score'], text = np.round_(show_metrics[0].values,4),
                    textposition = 'auto',
                   orientation = 'h', opacity = 0.8,marker=dict(
            color=colors,
            line=dict(color='#000000',width=1.5)))
 
     
    #plots
    model = model
    
    #Subplots
    fig = tls.make_subplots(rows=2, cols=1, print_grid=False, 
                          subplot_titles=('Confusion Matrix',
                                        'Metrics',
                                        ))
    
    fig.append_trace(trace1,1,1)
    fig.append_trace(trace2,2,1)
    
    fig['layout'].update(showlegend = False, title = '<b>Model performance report</b><br>'+str(model),
                        autosize = True, height = 800,width = 800,
                        plot_bgcolor = 'rgba(240,240,240, 0.95)',
                        paper_bgcolor = 'rgba(240,240,240, 0.95)',
                        # margin = dict(b = 100)
                        )
    fig.layout.titlefont.size = 14
    
    py.iplot(fig,filename='model-performance')
In [ ]:
extree = ExtraTreesClassifier()
extree.fit(X_train, y_train)
y_pred = extree.predict(X_test)
In [ ]:
extree.get_params()
Out[ ]:
{'bootstrap': False,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}
In [ ]:
model_performance(extree,y_test, y_pred)

🛠️ Hyperparameter Tuning¶

In [ ]:
gkf = KFold(n_splits=3, shuffle=True, random_state=42).split(X=X_train, y=y_train)

# A parameter grid for ETrees
params = {
    'n_estimators': range(100, 500, 100),  # [100,200,300,400,500]
    'ccp_alpha': [0.0, 0.1],
    'criterion': ['gini'],
    'max_depth': [5,11],
    'min_samples_split': [2,3],
}

extree_estimator = ExtraTreesClassifier()

gsearch = GridSearchCV(
    estimator= extree_estimator,
    param_grid= params,
    scoring='f1_weighted',
    n_jobs=-1,
    cv=gkf,
    verbose=1,
)


extree_model = gsearch.fit(X=X_train, y=y_train)
(gsearch.best_params_, gsearch.best_score_)
Fitting 3 folds for each of 32 candidates, totalling 96 fits
Out[ ]:
({'ccp_alpha': 0.0,
  'criterion': 'gini',
  'max_depth': 11,
  'min_samples_split': 2,
  'n_estimators': 300},
 0.8647603875807635)
In [ ]:
gkf2 = KFold(n_splits=3, shuffle=True, random_state=101).split(X=X_train, y=y_train)

params2 = {
    'max_depth': [11,15],
    'min_samples_split': [2,3],
    'class_weight': ['balanced', None],

}

extree2 = ExtraTreesClassifier(ccp_alpha = 0.0,
                                criterion = 'gini',
                                max_depth = 11,
                                min_samples_split = 3)

gsearch2 = GridSearchCV(
    estimator= extree2,
    param_grid= params2,
    scoring='f1_weighted',
    n_jobs=-1,
    cv=gkf2,
    verbose=3,
)

extree_model2 = gsearch2.fit(X=X_train, y=y_train)

(gsearch2.best_params_, gsearch2.best_score_)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
Out[ ]:
({'class_weight': None, 'max_depth': 15, 'min_samples_split': 2},
 0.9203765783132133)
In [ ]:
extree_tuned = ExtraTreesClassifier(ccp_alpha = 0.0,
                                criterion = 'gini',
                                min_samples_split = 2,
                                class_weight = 'balanced',
                                max_depth = 15,
                                n_estimators = 400)

extree_tuned.fit(X_train, y_train)
Out[ ]:
ExtraTreesClassifier(class_weight='balanced', max_depth=15, n_estimators=400)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ExtraTreesClassifier(class_weight='balanced', max_depth=15, n_estimators=400)
In [ ]:
y_pred_tuned = extree_tuned.predict(X_test)
In [ ]:
print(extree_tuned.feature_importances_)
[0.0565462  0.05789308 0.01420086 0.02952284 0.04402301 0.03613448
 0.0199071  0.03517829 0.02860482 0.03771439 0.01405304 0.0453975
 0.01112801 0.04420206 0.06420246 0.0178615  0.03507673 0.0687727
 0.05905148 0.02040236 0.02924959 0.03142726 0.02946688 0.02458844
 0.01077821 0.04141331 0.04219812 0.05100529]
In [ ]:
feat_importances = pd.Series(extree_tuned.feature_importances_, index=X.columns)
plt.figure(figsize=(12,12))
myexplode = [0.12,0,0,0,0,0,0,0,0,0]
plt.pie(feat_importances.nlargest(10),labels=feat_importances.nlargest(10).index, autopct='%.0f%%',explode= myexplode,
        textprops={'fontsize': 16})
centre_circle = plt.Circle((0,0),0.10,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Important features considered by Extra Trees Classifier',fontsize=20)
plt.show()

Explainable AI¶

In [ ]:
shap.initjs()
In [ ]:
X_sample = X_train.sample(100)
X_sample
Out[ ]:
day_of_week driver_age driver_sex educational_level driving_experience vehicle_type vehicle_owner service_year accident_area lanes road_allignment junction_type surface_type road_surface_conditions light_condition weather_condition collision_type vehicles_involved casualties vehicle_movement casualty_class casualty_sex casualty_age casualty_severity pedestrian_movement accident_cause hour minute
10697 3 1 1 4 0 0 3 0 8 5 3 7 0 1 0 2 8 1 1 2 1 0 0 2 5 3 18 5
303 0 1 1 4 4 0 3 3 9 6 5 1 0 0 3 2 1 1 0 2 0 1 3 2 5 12 9 9
5495 0 0 1 4 2 4 3 5 6 4 5 7 0 0 3 2 8 2 0 0 3 2 5 3 5 1 17 4
10431 2 0 1 3 1 6 2 5 4 4 5 1 0 0 0 2 8 1 2 3 0 0 0 2 5 2 20 2
18118 3 0 1 4 3 0 2 4 6 0 5 1 3 0 3 2 8 0 0 2 3 2 5 3 5 12 15 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4105 5 0 1 0 2 0 3 2 6 2 5 7 0 0 3 2 8 1 1 2 3 2 5 3 5 16 7 10
15682 5 2 1 2 1 8 3 3 5 2 5 6 0 0 2 2 8 0 0 2 3 2 5 3 5 9 10 8
14573 3 0 1 4 1 0 3 5 6 4 5 1 0 2 3 3 0 1 2 4 0 1 0 2 5 4 15 4
4813 1 0 1 4 1 15 3 0 9 2 5 7 0 0 3 2 8 2 0 2 1 1 0 2 5 0 11 10
7394 2 4 1 4 3 5 3 2 7 4 5 1 0 3 0 4 8 2 0 2 3 2 5 3 5 10 22 0

100 rows × 28 columns

In [ ]:
shap_values = shap.TreeExplainer(extree_tuned).shap_values(X_sample)
In [ ]:
shap.summary_plot(shap_values, X_sample, plot_type="bar")
In [ ]:
shap.force_plot(shap.TreeExplainer(extree_tuned).expected_value[0],
                shap_values[0][:], 
                X_sample)
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [ ]:
print(y_pred_tuned[50])
shap.force_plot(shap.TreeExplainer(extree_tuned).expected_value[0], shap_values[1][50], X_sample.iloc[50])
2
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [ ]:
i=13
print(y_pred_tuned[i])
shap.force_plot(shap.TreeExplainer(extree_tuned).expected_value[0], shap_values[0][i], X_sample.values[i], feature_names = X_sample.columns)
1
Out[ ]:
Visualization omitted, Javascript library not loaded!
Have you run `initjs()` in this notebook? If this notebook was from another user you must also trust this notebook (File -> Trust notebook). If you are viewing this notebook on github the Javascript has been stripped for security. If you are using JupyterLab this error is because a JupyterLab extension has not yet been written.
In [ ]:
print(y_pred_tuned[10])
row = 10
shap.waterfall_plot(shap.Explanation(values=shap_values[0][row], 
                                              base_values=shap.TreeExplainer(extree_tuned).expected_value[0], data=X_sample.iloc[row],  
                                         feature_names=X_sample.columns.tolist()))
2
In [ ]:
shap.dependence_plot('day_of_week', shap_values[2], X_sample)
In [ ]:
shap.dependence_plot('driver_age', shap_values[2], X_sample)
In [ ]:
print(y_pred_tuned[10])
shap.decision_plot(shap.TreeExplainer(extree_tuned).expected_value[0], 
                   shap_values[2][:10], 
                   feature_names=X_sample.columns.tolist())
2
In [ ]:
 
In [ ]:
 
In [ ]:
# upsampling using smote
y = df["accident_severity"]
X = df.drop(["accident_severity"], axis=1)

counter = Counter(y)

for k, v in counter.items():
    dist = v / len(y) * 100
    print(f"Class={k}, n={v} ({per:.2f}%)")
In [ ]:
models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('LOG', LogisticRegression()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('RFC', RandomForestClassifier()))
In [ ]:
names = []
results = []
for name, model in models:
    fold = KFold(n_splits=10)
    score = cross_val_score(model, X, y, cv=fold, scoring='accuracy')
    names.append(name)
    results.append(score)
In [ ]:
plotdict = dict(zip(names, results))
for k, v in plotdict.items():
    print(f"{k}, {round(v.mean(), 5)}")
In [ ]:
top10 = ['hour', "casualties", "day_of_week", "accident_cause", "vehicles_involved", "vehicle_type", 
         "driver_age", "accident_area", "driving_experience", "lanes"]
In [ ]:
df10 = df[top10]
df10
In [ ]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(df10, y, test_size=0.3, random_state=42)
print(X_train_new.shape, X_test_new.shape, y_train_new.shape, y_test_new.shape)
In [ ]:
model = RandomForestClassifier()
In [ ]:
model.fit(X_train_new, y_train_new)
In [ ]:
y_pred_new = model.predict(X_test_new)
score_new = accuracy_score(y_test_new, y_pred_new)
print("Accuracy score: ", score_new)
In [ ]:
joblib.dump(model, 'random_forest_final.joblib')
In [ ]:
df.shape
Out[ ]:
(12316, 29)
In [ ]:
def imbalanced_features(df):
    mod_imb=[] # moderately imbalanced
    hi_imb=[] # highly imbalanced

    for col in df.columns:
        try:
            if (df[col].value_counts()/df.shape[0]).max() > 0.9:
                hi_imb.append(col)
            elif (df[col].value_counts()/df.shape[0]).max() > 0.7 and col not in hi_imb:
                mod_imb.append(col)
        except Exception as e:
            print(f"Couldn't check \033[1m{col}\033[0m. ", e, "\n")

    print("="*20, "\033[1m Imbalanced features\033[0m", "="*20)
    print("No of moderately imbalanced features (75%+ data on a single class): ", len(mod_imb), "\n")
    print(mod_imb, "\n")
    print("No of highly imbalanced features (90%+ data on a single class): ", len(hi_imb), "\n")
    print(hi_imb)
imbalanced_features(df)
====================  Imbalanced features ====================
No of moderately imbalanced features (75%+ data on a single class):  7 

['vehicle_owner', 'road_allignment', 'road_surface_conditions', 'light_condition', 'weather_condition', 'collision_type', 'accident_severity'] 

No of highly imbalanced features (90%+ data on a single class):  3 

['driver_sex', 'surface_type', 'pedestrian_movement']
In [ ]:
# upsampling using smote
y = df['accident_severity']
X = df.drop('accident_severity', axis=1)

counter = Counter(y)

for k,v in counter.items():
    per = 100*v/len(y)
    print(f"Class= {k}, n={v} ({per:.2f}%)")
Class= Slight Injury, n=10415 (84.56%)
Class= Serious Injury, n=1743 (14.15%)
Class= Fatal injury, n=158 (1.28%)
In [ ]:
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)

counter = Counter(y)

for k,v in counter.items():
    per = 100*v/len(y)
    print(f"Class= {k}, n={v} ({per:.2f}%)")
Class= Slight Injury, n=10415 (33.33%)
Class= Serious Injury, n=10415 (33.33%)
Class= Fatal injury, n=10415 (33.33%)
In [ ]:
df = pd.concat([X, y], axis=1)
print("Upsampled data shape: ", df.shape)
Upsampled data shape:  (31245, 29)
In [ ]:
# selecting a good baseline model using cross validation

models = []
models.append(('KNN', KNeighborsClassifier()))
models.append(('LOG', LogisticRegression()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('RFC', RandomForestClassifier()))
In [ ]:
names = []
results = []
for name, model in models:
    fold = KFold(n_splits=10)
    score = cross_val_score(model, X, y, cv=fold, scoring='accuracy')
    names.append(name)
    results.append(score)
In [ ]:
plotdict = dict(zip(names, results))

for k,v in plotdict.items():
    print(f"{k}: {round(v.mean(),5)}")
KNN: 0.79581
LOG: 0.45408
DTC: 0.81608
RFC: 0.91164
In [ ]:
model = RandomForestClassifier()
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(21871, 28) (9374, 28) (21871,) (9374,)
In [ ]:
# fitting model
model.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [ ]:
# predicting
y_pred = model.predict(X_test)
In [ ]:
score = accuracy_score(y_test, y_pred)
print("Accuracy: ", score)
Accuracy:  0.9270322167697888
In [ ]:
# getting feature importance
model.feature_importances_
Out[ ]:
array([0.05976835, 0.04902253, 0.00852862, 0.02963963, 0.04457838,
       0.05237394, 0.01392963, 0.03375527, 0.04351982, 0.04510356,
       0.01514861, 0.04193056, 0.00787389, 0.0302327 , 0.0413342 ,
       0.0184541 , 0.03003746, 0.05763699, 0.06237717, 0.03179664,
       0.0196426 , 0.01589697, 0.02627878, 0.01208849, 0.01006779,
       0.05598957, 0.07489044, 0.06810332])
In [ ]:
df_importance = pd.DataFrame()
df_importance['Features'] = X.columns
df_importance['Importance'] = model.feature_importances_
plt.figure(figsize=(10, 12))
sns.barplot(data = df_importance.sort_values("Importance", ascending=False), y='Features', x='Importance');
In [ ]:
# selecting top 10 features
top10 = list(df_importance.sort_values("Importance", ascending=False)['Features'].head(10).values)
top10
Out[ ]:
['hour',
 'minute',
 'casualties',
 'day_of_week',
 'vehicles_involved',
 'accident_cause',
 'vehicle_type',
 'driver_age',
 'lanes',
 'driving_experience']
In [ ]:
df_top10 = df[['lanes',
 'minute',
 'casualties',
 'day_of_week',
 'vehicles_involved',
 'accident_cause',
 'vehicle_type',
 'driver_age',
 'lanes',
 'driving_experience']]
df_top10
Out[ ]:
lanes minute casualties day_of_week vehicles_involved accident_cause vehicle_type driver_age lanes driving_experience
0 2 1 1 1 1 9 0 0 2 0
1 4 1 1 1 1 16 11 1 4 3
2 6 1 1 1 1 0 5 0 6 0
3 6 2 1 3 1 1 11 0 6 2
4 6 2 1 3 1 16 0 0 6 1
... ... ... ... ... ... ... ... ... ... ...
31240 2 6 0 2 1 0 7 0 2 3
31241 2 1 0 3 0 11 5 1 2 1
31242 4 9 2 4 1 9 0 2 4 0
31243 3 10 0 3 1 10 0 0 3 1
31244 2 3 0 1 0 2 14 3 2 0

31245 rows × 10 columns

In [ ]:
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(df_top10, y, test_size=0.3, random_state=42)
print(X_train_new.shape, X_test_new.shape, y_train_new.shape, y_test_new.shape)
(21871, 10) (9374, 10) (21871,) (9374,)
In [ ]:
# fitting model
model.fit(X_train_new, y_train_new)
Out[ ]:
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
In [ ]:
# predicting
y_pred_new = model.predict(X_test_new)
score_new = accuracy_score(y_test_new, y_pred_new)
print("Accuracy: ", score_new)
Accuracy:  0.8595050138681459
In [ ]:
joblib.dump(model, 'random_forest_final.joblib')
Out[ ]:
['random_forest_final.joblib']

💡 Observations:

A

💡 Observations:

A

💡 Observations:

A

💡 Observations:

A